Skip to content

V18 Manifold-Guided Architecture — val_bpb 0.434#663

Open
raahilg wants to merge 2 commits intoopenai:mainfrom
raahilg:main
Open

V18 Manifold-Guided Architecture — val_bpb 0.434#663
raahilg wants to merge 2 commits intoopenai:mainfrom
raahilg:main

Conversation

@raahilg
Copy link

@raahilg raahilg commented Mar 25, 2026

Standard language models must simultaneously construct an internal representation of token relationships and learn to navigate that representation to make predictions. We separate these two jobs.

By precomputing a physics-simulated token manifold from corpus co-occurrence statistics, we freeze the geometric structure directly into the architecture. The model's job changes from construction + navigation to just navigation — a much easier task that lets the weights specialize entirely on exploiting the geometric prior rather than building it from scratch.

The result is essentially a GNN operating on a precomputed token interaction graph — the manifold defines graph topology, sparsemax produces edge weights, and hop cells perform node updates with message passing. Every architecture decision is chosen to exploit this geometric prior: sparsemax routing along manifold geodesics, spectral-coordinate-conditioned attention, entropy-guided message passing, and parallel transport across the token manifold.

With only 1024 tokens, the full pairwise statistics are trivially computable — the manifold captures essentially the complete statistical structure of the language. Normally, a model would need to rediscover these patterns through gradient descent. We hand them to a 20M parameter model on initialization.

raahilg and others added 2 commits March 24, 2026 20:21
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new Parameter Golf submission (“V18 Manifold-Guided Architecture + Sparsemax Routing”) including the training script, run logs for two seeds, and the submission metadata/README describing results.

Changes:

  • Introduces a new train_gpt.py implementing manifold construction + sparsemax-routed multi-hop message passing + manifold-guided attention, with int8+zlib export and roundtrip eval.
  • Adds training logs for seed 42 and seed 27 runs (including post-quant BPB).
  • Adds submission.json and a README documenting the approach and reported metrics.

Reviewed changes

Copilot reviewed 3 out of 5 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
records/track_10min_16mb/2026-03-24_V18_ManifoldGuided_Sparsemax/train_gpt.py New training/manifold/quantization script for the V18 submission.
records/track_10min_16mb/2026-03-24_V18_ManifoldGuided_Sparsemax/train_seed42.log Seed 42 training record and final int8+zlib roundtrip metrics.
records/track_10min_16mb/2026-03-24_V18_ManifoldGuided_Sparsemax/train_seed27.log Seed 27 training record and final int8+zlib roundtrip metrics.
records/track_10min_16mb/2026-03-24_V18_ManifoldGuided_Sparsemax/submission.json Submission metadata (score, size, blurb, author/date).
records/track_10min_16mb/2026-03-24_V18_ManifoldGuided_Sparsemax/README.md Write-up of the method, results table, and run instructions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

if not files:
raise FileNotFoundError(f"No files found for pattern: {pattern}")
tokens = torch.cat([load_data_shard(file) for file in files]).contiguous()
usable = ((tokens.numel() - 1) // seq_len) * seq_len
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

load_validation_tokens can return a too-short tensor when the validation split has fewer than seq_len + 1 tokens (or when seq_len is set too large). That leads to total_seqs == 0 in eval_val and a divide-by-zero when computing val_loss/val_bpb. Add the same usable <= 0 guard the baseline scripts use and raise a clear ValueError when the validation set is too short for the configured sequence length.

Suggested change
usable = ((tokens.numel() - 1) // seq_len) * seq_len
usable = ((tokens.numel() - 1) // seq_len) * seq_len
if usable <= 0:
raise ValueError(
f"Validation set too short for seq_len={seq_len}: "
f"only {tokens.numel()} tokens available."
)

Copilot uses AI. Check for mistakes.
Comment on lines +501 to +616
# Physics simulation (runs on rank 0's GPU, broadcast result)
torch.cuda.empty_cache()
log0(f"\n Physics simulation ({args.physics_dim}D, {args.physics_steps} steps)...")
torch.manual_seed(args.seed)
torch.cuda.manual_seed(args.seed)
pos = torch.nn.Parameter(torch.randn(V, args.physics_dim, device=device) * 0.1)
opt_sim = torch.optim.Adam([pos], lr=0.05)
sched = torch.optim.lr_scheduler.CosineAnnealingLR(opt_sim, args.physics_steps, eta_min=0.0005)

src_t = torch.tensor(np.concatenate([rows, cols]), dtype=torch.long, device=device)
dst_t = torch.tensor(np.concatenate([cols, rows]), dtype=torch.long, device=device)
sw_t = torch.tensor(np.concatenate([spring_w, spring_w]), dtype=torch.float32, device=device)
mass_t = torch.tensor(entropic_mass, dtype=torch.float32, device=device)
asym_t = torch.tensor(asymmetry, dtype=torch.float32, device=device)
dsrc_t = torch.tensor(np.concatenate([dir_rows, dir_cols]), dtype=torch.long, device=device)
ddst_t = torch.tensor(np.concatenate([dir_cols, dir_rows]), dtype=torch.long, device=device)
dw_t = torch.tensor(np.concatenate([dir_w_vals, dir_w_vals]), dtype=torch.float32, device=device)
n_rep = min(80000, V*(V-1)//2)
sf = (V*(V-1)//2) / n_rep
# CPU RNG for deterministic physics across different GPU hardware
phys_rng = torch.Generator()
phys_rng.manual_seed(12372)
t0 = time.time()

for step in range(args.physics_steps):
opt_sim.zero_grad()
n_ss = min(200000, len(src_t))
si = torch.randint(0, len(src_t), (n_ss,), generator=phys_rng).to(device)
d = pos[src_t[si]] - pos[dst_t[si]]
E_spring = (len(src_t)/n_ss) * torch.sum(sw_t[si] * torch.sum(d**2, dim=1))

ri = torch.randint(0, V, (n_rep,), generator=phys_rng).to(device)
rj = torch.randint(0, V-1, (n_rep,), generator=phys_rng).to(device)
rj = rj + (rj >= ri).long()
E_rep = sf * torch.sum(1.0 / torch.norm(pos[ri]-pos[rj], dim=1).clamp(min=1e-4))

ai_idx = torch.where(asym_t > asym_t.median())[0]
n_ap = min(2000, len(ai_idx)*(len(ai_idx)-1)//2)
if n_ap > 0 and len(ai_idx) > 1:
ai = ai_idx[torch.randint(0, len(ai_idx), (n_ap,), generator=phys_rng).to(device)]
aj = ai_idx[torch.randint(0, len(ai_idx), (n_ap,), generator=phys_rng).to(device)]
mk = ai != aj; ai, aj = ai[mk], aj[mk]
E_torsion = 0.5 * torch.sum(
asym_t[ai]*asym_t[aj] / torch.norm(pos[ai]-pos[aj], dim=1).clamp(min=1e-4)
) if len(ai) > 0 else torch.tensor(0.0, device=device)
else:
E_torsion = torch.tensor(0.0, device=device)

gi = torch.randint(0, V, (n_rep,), generator=phys_rng).to(device)
gj = torch.randint(0, V-1, (n_rep,), generator=phys_rng).to(device)
gj = gj + (gj >= gi).long()
E_grav = -sf * 0.1 * torch.sum(
mass_t[gi]*mass_t[gj] / torch.norm(pos[gi]-pos[gj], dim=1).clamp(min=1e-4))

if len(dsrc_t) > 0:
n_ds = min(100000, len(dsrc_t))
di = torch.randint(0, len(dsrc_t), (n_ds,), generator=phys_rng).to(device)
dd = pos[dsrc_t[di]] - pos[ddst_t[di]]
E_dir = 0.3 * (len(dsrc_t)/n_ds) * torch.sum(dw_t[di] * torch.sum(dd**2, dim=1))
else:
E_dir = torch.tensor(0.0, device=device)

(E_spring + E_rep + E_torsion + E_grav + E_dir).backward()
torch.nn.utils.clip_grad_norm_([pos], 10.0)
opt_sim.step(); sched.step()
if step % 1000 == 0:
log0(f" physics step {step} ({time.time()-t0:.0f}s)")

positions = pos.detach().cpu().numpy()
del pos, opt_sim, sched
torch.cuda.empty_cache()
log0(f" Physics done ({time.time()-t0:.0f}s)")

# Hessian eigendecomposition
log0(f" Computing Hessian...")
coupling = np.zeros((V, V), dtype=np.float32)
for k in range(len(rows)):
w = float(spring_w[k])
coupling[rows[k], cols[k]] += 2 * w
coupling[cols[k], rows[k]] += 2 * w
coupling += np.outer(entropic_mass, entropic_mass) * 0.1
for k in range(len(dir_rows)):
v = float(dir_w_vals[k]) * 0.3
coupling[dir_rows[k], dir_cols[k]] += v
coupling[dir_cols[k], dir_rows[k]] += v
chunk = 256
for i in range(0, V, chunk):
ie = min(i+chunk, V)
diff = positions[i:ie, None, :] - positions[None, :, :]
d = np.linalg.norm(diff, axis=-1)
d = np.maximum(d, 1e-8)
coupling[i:ie] += (1.0 / (d**3)).astype(np.float32)
np.fill_diagonal(coupling, 0)

coupling_t = torch.from_numpy(coupling.astype(np.float64))
evals_all, evecs_all = torch.linalg.eigh(coupling_t)
idx_ = torch.argsort(evals_all, descending=True)[:args.hessian_modes]
evals = evals_all[idx_].numpy()
evecs = evecs_all[:, idx_].numpy()
# Fix eigh sign ambiguity — make largest element in each column positive
for i in range(evecs.shape[1]):
if evecs[np.argmax(np.abs(evecs[:, i])), i] < 0:
evecs[:, i] *= -1
hessian_coords = (evecs * np.sqrt(np.abs(evals))[None, :]).astype(np.float32)
log0(f" Hessian: {hessian_coords.shape}, eigenvalues: {evals[0]:.2f} → {evals[-1]:.2f}")

dir_scale = 0.5 * np.std(hessian_coords) / (np.std(directional_coords) + 1e-8)
syn_scale = 0.3 * np.std(hessian_coords) / (np.std(syntactic_coords) + 1e-8)
combined = np.concatenate([
hessian_coords,
directional_coords[:, -32:] * dir_scale,
syntactic_coords[:, -32:] * syn_scale,
], axis=1).astype(np.float32)

log0(f" Manifold ready: {combined.shape} ({time.time()-t_total:.0f}s total)")
return combined
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

build_manifold_distributed says the physics simulation “runs on rank 0's GPU, broadcast result”, but the code currently runs the physics simulation + Hessian eigendecomposition on every rank. In multi-GPU runs this duplicates the most expensive work and can blow the wallclock budget. Consider gating the physics/Hessian section with if rank == 0, then broadcasting the resulting combined manifold coordinates (e.g., via dist.broadcast on a tensor) to other ranks.

Copilot uses AI. Check for mistakes.
Comment on lines +18 to +19
import subprocess
import sys
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are unused imports (subprocess, sys) in this file. Removing them reduces lint noise and avoids implying there are subprocess/system side effects.

Suggested change
import subprocess
import sys

Copilot uses AI. Check for mistakes.
import torch.distributed as dist
import torch.nn.functional as F
from torch import Tensor, nn
from torch.nn.parallel import DistributedDataParallel as DDP
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DistributedDataParallel as DDP is imported but not used (the script implements manual reduce/broadcast instead). Dropping the unused import will avoid confusion about whether DDP is expected here.

Suggested change
from torch.nn.parallel import DistributedDataParallel as DDP

Copilot uses AI. Check for mistakes.
Comment on lines +38 to +41
import os



Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

os is imported twice (near the top and again here). Removing the duplicate import keeps the module header tidy.

Suggested change
import os

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,60 @@
# V18 Manifold-Guided Architecture — val_bpb: 0.438
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The README title reports val_bpb: 0.438, while submission.json reports val_bpb: 0.4343. If 0.438 is meant to be the mean across seeds, consider updating the title to say “mean val_bpb” (or update it to the best/official score) to avoid ambiguity.

Suggested change
# V18 Manifold-Guided Architecture — val_bpb: 0.438
# V18 Manifold-Guided Architecture — mean val_bpb: 0.438

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants